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1. INTRODUCTION 

Buying and selling transactions using e-commerce has become a common lifestyle behaviour in today’s 
society because people can buy products without having to physically visit a store which can reduce efforts to 
shop offline. The focus of e-commerce today is to maximize shopping efficiency with strategies such as product 
search convenience, one-click purchases, and virtual catalogues that have specifications and recommendations 
based on consumer shopping behaviour in the past [1]. This resulted in consumers from e-commerce experiencing 
an increase and the occurrence of electronic word of mouth. Electronic word of mouth (EeWOM) refers to 
consumer feedback and points of view regarding products or services which can be in the form of votes, 
comments, ratings, reviews, or a post on a blog [2], [3]. Due to the existence of eWOM, consumers are more 
interested in products discussed online compared to products discussed traditionally (offline) in which eWOM is 
a richer source of objective information [4]. One type of eWOM is product reviews. Product reviews refer to 
textual reviews from consumers that describe characteristics such as the advantages or disadvantages of a 
product [5]. Product reviews are important because they have an effect on consumer decisions in purchasing a 
product based on the attributes, usage situations, and performance of the product by other consumers [6]. 
Even so, product reviews can also be unimportant if the quality of the information from the reviews is not helpful. 
This can be solved by classifying the reviews as helpful or not based on the features (characteristics) of the 
reviews. 
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Different researchers have proposed various methodologies to be able to classify reviews as helpful or 
not, most of which focus on the application of feature engineering. Like Krishnamoorthy [7] who proposed the 
use of linguistic features, review metadata, readability, and subjectivity. Where by carrying out the pre-processing 
stage, namely deleting duplicate reviews, deleting reviews that have a low total vote, and deleting reviews that do 
not contain content, obtaining an f-measure value of 0.614 for naive bayes, 0.753 for support vector machine 
(SVM), and 0.778 for random forests. Saumya ef al. [8] proposed the novelty of using product descriptions and 
customer question-answers as feature extraction. By applying the synthetic minority over-sampling (SMOTE) 
technique, the naive bayes, SVM, and random forest models yield Fl-score values of 0.565, 0.805, and 0.87, 
respectively. But unfortunately, in this study, the pre-processing stages were only explained briefly, such as 
removing unicode characters and images and reducing reviews that had high votes. Du et al. [9] proposed the use 
of semantics and sentiment features where at the pre-processing stage reviews that are not in English, duplicates, 
and have a total vote of less than 10 will be deleted. The review has also changed its form to lowercase and word 
tokens. The final results show that semantic features are superior with the highest accuracy value obtained at 0.81. 
Akbarabadi and Hosseini [10] proposes novelty by using reviewers and title characteristics as its features. Using 
the pre-processing stage, namely deleting reviews that have a total vote of less than 10 and paying attention to the 
proportion of the helpfulness ratio received by the review, these two features get the highest f-measure value of 
0.96 for the random forest model and 0.88 for the decision tree model. All of these studies used Amazon’s product 
review dataset in their tests. The research uses different datasets and approaches, such as Ma et al. [11] which 
uses text reviews and photos as its features, and Yelp and TripAdvisor as its datasets. Where by using the proposed 
deep learning model, the model can outperform other baseline models (decision tree, SVM, and logistic 
regression) with the highest Fl-score value of 0.79. Then Luo and Xu [12] proposed a new approach, namely by 
conducting an exploratory analysis using semantic, sentiment, and latent dirichlet allocation which was tested on 
the Yelp dataset. The combination of SVM and fuzzy domain ontology produces the highest Fl-score of 0.795 
compared to other models. 

Based on the previous research described above, these studies still do not explain in detail or completely 
how the preprocessing stages were carried out before to get a review of whether is helpful or not. As for 
Meng et al. [13] which also uses the Amazon product review dataset, in the use of structural features, it proposes 
not only to look at the character or word features of a review text but also to consider the relationship or meaning 
of the character or word itself. This work proposes the use of structural features namely features that focus on the 
structure of a word, and semantic features namely features that focus on the meaning of the word it self [14] as a 
process in its feature extraction. As for the preprocessing stage, each process will be explained in detail so that 
you can find out how the correct process is to get a clean review. Furthermore, for the classification model, the 
SVM model will be used based on its performance in previous studies [15]—[17]. As another form of contribution, 
in this study the two features will also be combined to find out whether the combination has an effect or not on 
the value obtained by the classification results in the model later. 


2. METHOD 

To answer the problems previously described, the proposed research stages are shown in Figure 1. Where 
in Figure | there are six main stages in conducting a classification to find out whether a review is helpful or not 
based on the features that have been proposed previously. The six main stages are data collection, data labelling, 
data pre-processing, feature extraction, modelling, and model evaluation. 
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Figure 1. Proposed method 
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A summary of the six main stages is as follows. In the data collection, we collect data first by adjusting 
to the research that is being carried out. Because this study is related to a product review, the data needed is data 
that is also in the form of reviews. Then in data labelling, the data that has been obtained will be given a label to 
state the ground truth of the data. For classification problems, the ground truth of the data is important not only to 
help to train the model but also to assist in validating the results of the model. Next in data preprocessing, existing 
data will be processed to obtain clean data such as removing punctuation, symbol, or even expanding 
abbreviations. This is also necessary to do because a model tends to produce better classification performance if 
the data used is clean. After that, in the feature extraction, the clean data will be transformed in such a way that 
the machine learning model can understand. Then in the final stage (model evaluation), the model will be 
evaluated for its performance based on the classification results that have been obtained. Detailed explanations 
related to each stage will be explained further in each sub-chapter. 


2.1. Data collection 

In this study the data used is the 2015 Amazon product review dataset based on its use in previous 
studies. The dataset used is an open dataset obtained from the official website [18]. In the dataset there are 15 
variables. However, this study was limited to only using six variables because only those variables will be used 
for the next stage such as data labelling, data preprocessing, and feature extraction. The six variables are customer 
id, product id, helpful votes, total votes, verified purchase and review body. The values and descriptions of the 
six variables are shown in Table 1. 


Table 1. Research data variables 


Variable name Example value Description 
customer_id 302120, 445 Random value identifier that can be used to aggregate reviews written by one 
customer. 
product_id ‘BOOMUTIDK’’ Unique id of a product. 
helpful_votes 12, 8, 24 Total helpful votes received by a review. 
total_votes 8, 23, 46 Total votes, both helpful votes (likes) and unhelpful votes (dislikes) received by a 
review. 
verified_purchase “Y’ or ‘N’ It has a ‘Y’ value if the customer has been verified to buy the product directly on 


Amazon without being given an excessive discount. While the value is ‘N’ if the 
customer buys the product indirectly through Amazon and does not pay the price 
available for most buyers on Amazon. 

review_body ‘the product is so good’ _ The contents of the reviews given. 


Because there are differences in assessing the quality of products based on their categories [19], this 
research will use two different types of product categories, namely search products and experience products. 
Where the search product is a product category that consumers can value before buying the product [20]. Whereas 
the experience product is a product category that consumers cannot value before buying the product, so they must 
try it first [21]. In the search product, the video games dataset will be used, while the experience product will use 
the beauty product dataset. 


2.2. Data labelling 

In data labelling, reviews will be given a label. The label indicates whether a review is helpful review or 
not. One indicator of a helpful review is by looking at the number of helpful votes in the review. The more the 
number of helpful votes, the more likely it is that the review is helpful [22]. Based on Ghose and Ipeirotis [23] 
which also uses the Amazon product review dataset, a review can be labelled as a helpful review if the value of 
the helpful ratio in the review is worth more than 0.6. Where the value of the helpful ratio is as in (1) [24]. 


helpful_votes 


helpful_ratio = (1) 


total_votes 

This value of 0.6 is then used in many of the same studies, which use the Amazon product review dataset 
and classify it to find out whether the review is helpful review or not. Reviews must also have a minimum total 
vote value of more than 10. Because reviews with a small total vote value can be biased and unreliable [25]. 


2.3. Data preprocessing 

Data preprocessing is needed to get clean reviews as well as reviews that have done restoration on its 
words. Cleaning and restoration of words in the review are important to do because they can provide a strong 
foundation for the next modelling stage [26]. Six processes will be carried out on data preprocessing, namely as 
follows: 
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— Data cleansing: In data cleansing, you will get reviews that are reliable and structurally clean. Where reviews 
will be cleaned from incomplete/blank data, consumers who don’t buy products, and duplicate data. 
Furthermore, the review will be changed to normal form, removing uniform resource locator (URL) and 
hypertext markup language (HTML) tags, expanding abbreviations and slang words, and removing 
punctuation, symbols, and numbers. 

— Tokenization: Tokenization is used to separate sentences into word tokens. Where the results will be used in 
the next process, namely stopword removal. 

— Stopword removal: Stopword removal is the process of removing common words in the review language, 
namely English. These common words (stopwords) appear frequently in many parts of the text in a document 
but carry little information about which part of the text they belong to [27] it can affect the performance of the 
model [28] and need to be removed. 

— Lemmatization: Lemmatization is needed to change the word to its basic form while still paying attention to 
the content and meaning of the word. 

— Spelling correction: In spelling correction, a word is checked first, namely whether it is present or not in the 
English word dictionary (WordNet). Furthermore, if the word is not in WordNet, then the word has the 
possibility of having an error in its spelling which needs to be corrected. 


2.4. Feature extraction 

Feature extraction is a process of transforming existing features in text into quantitative data structures 
such as numbers to be entered as input in the model training process. In terms of semantic features, feature 
extraction is a process of extracting features whose role is to find out the meaning of words in a text. Where in 
this study, the term frequency-inverse document frequency (TF-IDF) [29] method will be used on the semantic 
features represented in (2) for term frequency (TF). 


fij 
maxkf kj 


As an example, given a collection of N documents, then fj; in TF is the number of occurrences of a 
word i in document j. And max, f,; is the maximum value k of the occurrence of any word in document j. 
Whereas in DF the word i is denoted as in (3). 


IDF, = log * (3) 


where N is the number of documents in the corpus and N; is the number of documents containing the word i in 
the N documents in the corpus. The value obtained from this calculation is between zero and one. Where the 
largest value is the value that approaches the value of one. 

Next structural feature extraction is a process of extracting features whose role is to find out the structure 
and format of a text document. Where this study will be carried out by the calculation of the number of words and 
the number of sentences from the review. The value obtained from this calculation can be in units, tens, or even 
hundreds. The flow of the calculation can be seen in Figure 2. In Figure 2, the review’s body serves as the primary 
variable (x) and is processed to determine the number of words (x1) and sentences (x2). To calculate the number 
of words, the review's body must first be cleaned of any punctuation, symbols, and numbers. Afterwards, it will 
be transformed into word tokens, allowing for the number of tokens to be counted. Next, to be able to produce 
the number of sentences, the review’s body is inputted into the sent tokenize function from the natural language 
toolkit (NLTK) library. This library identifies a period as the end of a sentence while considering other factors 
that may impact sentence identification. The function then calculates the total number of sentences produced. 


Cleaned from punctuation, |_| Tokenization uses | Count the number of 54:8 Word Count 
symbols, and numbers python's string split() items in the list 


x = review_body 


Tokenization uses NLTK's Count the number of ‘ee 
F . >} : j ; x2 = number of sentences 
sent_tokenize() library items in the list 


Figure 2. Structural feature extraction flow 


Indonesian J Elec Eng & Comp Sci, Vol. 32, No. 3, December 2023: 1495-1502 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1499 


As previously explained, this research will also combine these two features. In the merger, a standard 
scaler will be used to be able to equalize the existing values, because the results of the extraction of the two 
features have different value scales that vary. The formula for the standardization process using a standard scaler 
is as in (4) where yp is the mean and a is the standard deviation. 


pet (4) 


2.5. Modelling 

At the modelling stage, there are three scenarios (in the feature extraction) to determine how well the 
model can predict which reviews are helpful or not. Namely, in the first scenario, the model will use semantic 
features, then in the second scenario the model will use structural features, and for the last scenario, the model 
will use a combination of these two features. Next, the k-fold cross validation method will be used with a value 
of k equal to 10 to be able to carry out the modelling process. 


2.6. Model evaluation 
To be able to evaluate the model, this study will use the Fl-score assessment matrix. Fl-score is a 
balancing value between precision and recall with a range of values from zero to one. As before, the Fl-score 
formula itself is a combination of precision (5) and recall (6) as in (7). 
True Positive 


Precision = ——_—___— (5) 
(True Positive + False Positive) 


Recall = True Positive (6) 


(True Positive + False Negative) 


recision * recall 
Fi—sore=2x See 


(7) 


precision + recall 


3. RESULTS AND DISCUSSION 

In this study, a dataset of Amazon product reviews was obtained for the video games and beauty category 
where each dataset has 1,048,576 rows and 15 existing variables. After getting the data to be used, the next step 
is to label each review. As previously explained, labelling is done by looking at the number of helpful votes and 
the total votes received by the review. The results obtained after data labelling and preprocessing in the review 
are shown in Table 2. 


Table 2. Example of review labelling and preprocessing results 


Helpul votes _ Total votes Review body Helpful ratio Label Clean review 
10 27 Very good, inexpensive brush! Bought 0.37 unhelpful good inexpensive brush buy 
5 for my wife. On the last one and amp wife last one ready order 
ready to re-order. Can’t beat the price beat price 
https://www. youtube/nV HP49gSIPQ. 
18 23 Who pays 4 dollars more for a $20 gift 0.78 helpful —_ pay dollar gift card store sell 
card? <br/>What store doesn’t sell gift gift card extra dollar sound 
cards that the extra 4 dollars sounds like like good idea 
a good idea? 
12 12 Fun game, fast delivery. <br/>No 1 helpful fun game fast delivery 
problems or complaints. Nice aqnd fast problem complaint nice and 
delivery. Game is in excellent fast delivery game excellent 
condition. Brand new i believe. So, it is condition brand new believe 
gr8 great 


The results of this labelling are as explained in the research methodology, namely that reviews can be 
labelled as helpful reviews if the results of the division between helpful votes and total votes are more than 0.6 
and the minimum value of the total votes is more than 10. Furthermore, the modelling results are as follows. 
The model was tested with three scenarios, namely the first using semantic features, the second using structural 
features, and the third using combination features (combining semantic and structural features). Because the 
problem of imbalance classes is not the focus of this study, the review data will also be class balanced by manually 
equating the number of positive (helpful) and negative (unhelpful) classes. By using 8,000 lines of review, the 
results of the modelling are shown in Table 3. 

Table 3 shows the results of the modelling by using the SVM model and the features that have been 
proposed previously. The results obtained are that the two proposed features both have high Fl1-score values but 
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with the difference that semantic features are superior using search products (video games), while structural 
features are superior using experience products (beauty). From the two types of datasets, another conclusion can 
be drawn, namely, the Fl-score obtained by the video game product review dataset has a higher value when 
compared to the beauty product review dataset. Where this shows that reviews in the search product category are 
easier to classify than reviews in the experience product category. Malik [30] which says that review content 
indicators are more influential on search products in predicting which reviews are helpful or not when compared 
to experience products. Chua and Banerjee [31] also said that search products are easier to evaluate product quality 
even before the product is purchased compared to experience products. Furthermore, this study also combines 
semantic and structural features based on suggestions from previous studies. In the results of combined feature, 
the average value is lower when compared to each of the semantic and structural features, even though scaling 
has been done on the data. This can also be possible if the data generated from the merging process is not optimal, 
such as some data are not equal in shape. 


Table 3. Modelling results 


Classifier and feature name Dataset category Fl-score 
SVM+Semantic feature Beauty 0.774 + 0.01 
Video games 0.825 + 0.01 
SVM+Structural feature Beauty 0.780 + 0.01 
Video games 0.823 + 0.01 
SVM+Combined feature Beauty 0.736 + 0.01 
Video games 0.785 + 0.01 


4. CONCLUSION 

In this study, as suggested by previous research, it has been proposed to use semantic and structural 
features as a feature extraction process to be able to classify in predicting which reviews are helpful or not. By 
using the SVM model and carrying out the right preprocessing stages, semantic features can produce the highest 
Fl-score value of 0.825 and for structural features can produce the highest Fl-score value of 0.823. Furthermore, 
in other forms of novelty, this research has also combined these two features. By scaling the data first, this 
combination feature can produce the highest Fl-score value of 0.785. However, when compared to the two 
features, the combination feature has a lower Fl-score. Even so, the two features that have been proposed 
previously proved to be able to predict well which reviews are considered helpful or not. For the SVM model 
itself, the model has also been proven that it can work well in text classification namely by using semantic, 
structural, or even combination features as feature extraction. Model is also proven to be able to overcome the 
problem of document vectors which are generally sparse (few non-zero values), which sparse data can increase 
the time and space complexity of the model. There are some limitations to our research. Firstly, the data labelling 
still uses an automatic method, namely by looking at the number of helpful votes and total votes received by 
reviews which are based on previous research which also used the same method. It is necessary to validate experts 
manually by taking samples from all data that has been previously labelled. Secondly, for scaling the data, namely 
to combine the two features, other methods can be used besides the standard scaler. Because the standard scaler 
does not work optimally on the data used, that is, it cannot calculate the mean of the data in the form of a 
compressed sparse matrix. Lastly, for the SVM model itself, because this research focuses more on applying the 
features that have been proposed previously, the model used is a standard SVM model where no modifications 
are made to the model. The model can be modified such as tuning parameters on the kernel (linear, poly, and 
gaussian) or the regularization value. Since this study only uses the default rbf kernel and the default value of the 
regularization parameter which is 1.0. 
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